140 research outputs found

    Surprises in High-Dimensional Ridgeless Least Squares Interpolation

    Full text link
    Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ2\ell_2 norm (``ridgeless'') interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors xi∈Rpx_i \in {\mathbb R}^p are obtained by applying a linear transform to a vector of i.i.d.\ entries, xi=Σ1/2zix_i = \Sigma^{1/2} z_i (with zi∈Rpz_i \in {\mathbb R}^p); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi=φ(Wzi)x_i = \varphi(W z_i) (with zi∈Rdz_i \in {\mathbb R}^d, W∈Rp×dW \in {\mathbb R}^{p \times d} a matrix of i.i.d.\ entries, and φ\varphi an activation function acting componentwise on WziW z_i). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.Comment: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficient

    Elastic Net Regularization Paths for All Generalized Linear Models

    Get PDF
    The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models

    Strong rules for discarding predictors in lasso-type problems

    Full text link
    We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions to provide safe rules that offer substantial speed and space savings in a variety of statistical convex optimization problems.Comment:

    Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation

    Full text link
    Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, a model distillation and comparison approach to audit such models. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by black-box models. We compare the student model trained with distillation to a second un-distilled transparent model trained on ground-truth outcomes, and use differences between the two models to gain insight into the black-box model. Our approach can be applied in a realistic setting, without probing the black-box model API. We demonstrate the approach on four public data sets: COMPAS, Stop-and-Frisk, Chicago Police, and Lending Club. We also propose a statistical test to determine if a data set is missing key features used to train the black-box model. Our test finds that the ProPublica data is likely missing key feature(s) used in COMPAS.Comment: Camera-ready version for AAAI/ACM AIES 2018. Data and pseudocode at https://github.com/shftan/auditblackbox. Previously titled "Detecting Bias in Black-Box Models Using Transparent Model Distillation". A short version was presented at NIPS 2017 Symposium on Interpretable Machine Learnin

    Comparing spatial patterns of marine vessels between vessel-tracking data and satellite imagery

    Get PDF
    Monitoring marine use is essential to effective management but is extremely challenging, particularly where capacity and resources are limited. To overcome these limitations, satellite imagery has emerged as a promising tool for monitoring marine vessel activities that are difficult to observe through publicly available vessel-tracking data. However, the broader use of satellite imagery is hindered by the lack of a clear understanding of where and when it would bring novel information to existing vessel-tracking data. Here, we outline an analytical framework to (1) automatically detect marine vessels in optical satellite imagery using deep learning and (2) statistically contrast geospatial distributions of vessels with the vessel-tracking data. As a proof of concept, we applied our framework to the coastal regions of Peru, where vessels without the Automatic Information System (AIS) are prevalent. Quantifying differences in spatial information between disparate datasets—satellite imagery and vessel-tracking data—offers insight into the biases of each dataset and the potential for additional knowledge through data integration. Our study lays the foundation for understanding how satellite imagery can complement existing vessel-tracking data to improve marine oversight and due diligence

    Gene Expression Programs of Human Smooth Muscle Cells: Tissue-Specific Differentiation and Prognostic Significance in Breast Cancers

    Get PDF
    Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression
    • …
    corecore